Assignment 4

In the previous workbook, we had a classifier which was designed to picks between two specific digits, one that we called signal and the other we called background.

In this assignment, we will want to read in all of the digits, and design a classifier which finds a specific digit (our signal again), but all of the other 9 digits will serve as the background. Our background will naturally be 9 times bigger than our signal (unless we limit it).

Tasks

  1. You will need to shuffle or randomize the rows of the background data. I will provide code for reading the digits in, but they will be in batches of the the same digit. To igure out how to do this, google sklearn shuffle pandas.
  2. Come up with a method to limit the rows of the background data that you use, so that it is the same length (in rows) as the signal dataframe. You will want some easy way to turn this on and off. Run first with the background limited to the same length as the signal.
  3. After step 1, you will have a signal dataframe, and a randomized background dataframe. You will need to combine these two into a single dataframe. We have done this before, but you can look at pandas concat function. Use the name dfCombined for the combined dataframe.
  4. Next you will want to apply an estimator to this dataset. I want you to make a function that does both the test/train split, and then calls the estimator. The function should look like this
   * Make a training and testing dataset using the sklearn function **train_test_split** as we did in the previous workbook.
   * Use the sklearn estimator LinearSVC to fit the training data, and then predict results for both the test and training data.
  1. Next, you need to get the performance of the estimator on both the training and testing data. To do this, I want you to make a function to calulate various performance metrics, and return the result. The function should look like this:

    def binaryPerformance(y,y_pred,y_score):

    .... your code goes here

    return precision,recall,auc,fpr, tpr, thresholds

    In this method,

    * y = array of true labels
    * y_pred = array of predicted labels from the **predict** method of the LinearSVC estimator
    * y_score = array of scores from the **decision_function** method of the LinearSVC estimator
    * precision,recall,auc are the calculated values of these metrics
    * fpr, tpr, thresholds are lists containing the  "false positive rate", "true positve rate", and "threshold"
  2. Call the "binaryPerformance" method for both the training and testing results for your estimator. How do they compare? Look at precision, recall, AUC, and the ROC curve.

  3. Run with the same classifier but now with the full background statistics.
  4. Run with a different classifier (and the large background statistics): the SGDClassifier. Compare your results to the LinearSVC classifier.

Helpful for good plotly-express behavior

Useful Methods

Getting the data:

The data is in /fs/ess/PAS2038/PHYSICS5680_OSU/data/ch3/.

Defining Signal

At the top of this block we define which of the 10 digits we want to use for our signal.

Task1: Shuffle background

You will need to shuffle or randomize the rows of the background data. We already read the digits in, but they were in batches of the the same digit. So we need to shuffle them to mix the digits up. To igure out how to do this, google sklearn shuffle pandas. Note that the "shuffle" method from sklearn creates copies of the orginal dataframe.

Tasks 2: Limit background

Come up with a method to limit the rows of the background data that you use, so that it is the same length (in rows) as the signal dataframe. You will want some easy way to turn this on and off. Run first with the background limited to the same length as the signal. Later you can come back and use all of the background data.

Call your limited background dataframe dfB_use.

Task 3: Combining data

After steps 1 and 2 , you will have a signal dataframe, and a randomized background dataframe. You will need to combine these two into a single dataframe. We have done this before, but you can look at pandas concat function. Use the name dfCombined for the combined dataframe.

Task 4: runFitter Method

Next you will want to apply an estimator to this dataset. I want you to make a function that does both the test/train split, and then calls the estimator. The function should look like the following "skeleton". I show the expected inputs and the expected return values. Note we did all of this in the example workbook.

Task 5: Run the fitter

Now we can use the function we defined above. We have to define our estimate (which we get from sklearn) outside of the method, and pass it as an argument to our defined function. We do it this way because later on we will want to call the method with a different estimator

Task 6: Implement Performance Method

Next, you need to get the performance of the estimator on both the training and testing data. To do this, I want you to make a function to calulate various performance metrics, and return the result. The function should look like this:

def binaryPerformance(y,y_pred,y_score):

.... your code goes here

return precision,recall,auc,fpr, tpr, thresholds

In this method,

     * y = array of true labels
     * y_pred = array of predicted labels from the **predict** method of the LinearSVC estimator
     * y_score = array of scores from the **decision_function** method of the LinearSVC estimator
     * precision,recall,auc are the calculated values of these metrics
     * fpr, tpr, thresholds are lists containing the  "false positive rate", "true positve rate", and "threshold"

Task 7: Call performance method

Call the "binaryPerformance" method for both the training and testing results for your estimator.

Compare them in the following metrics:

  1. precision
  2. recall
  3. AUC
  4. Plot ROC curves for both

Task 8: A different Classifier

Run with a different classifier (and the large background statistics): the SGDClassifier. Compare your results to the LinearSVC classifier.

Compare for the test set, SGDClassifier vs LinearSVC:

  1. precision
  2. recall
  3. AUC
  4. Plot ROC curves for both

Extra Credit:

  1. For a given signal digit, is the accuracy with which background is rejected, dependent upon what the background digit is? For example, given that our signal digit is 4, do you expect that the accuracy that an 8 is identifed as background the same as the accuracy that a 1 is identified as background? Probably not! To answer this:

    • Note that in training the above LinearSVC classifier we had the signal digit as 5.
    • When we ran our background through this classifier, the backgrounds were composed of digits 0,1,2,3,4,6,7,8,9 - all of the digits other than 5.
    • Count how often each of the background digits is in the test sample and store this (using a defaultdict to store this count would be a good idea!), and also how often each of those digits were correctly classified as background (using a separate defaultdict to store this count would also be a good idea!).
    • Take the ratio of these two and you get the "efficiency" to reject the background as a function of the background digits.
  2. Limit the signal to 1/10 of its maximum, and the total background to the same number. (So sigbnal and background have equal size) Compare the performance of the estimator using AUC. Is it worse than what we obtained above? Use LinearSVC for the estimator.

Look at the digits in the background

Rank the above by accuracy

Impact of less data

When the data is cut, the AUC is smaller than with a larger data set. Thus, the performance of classifying our signal is worse with less data